<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	<title>Hyphenation decompounder token filter | ElasticSearch 7.7 权威指南中文版</title>
	<meta name="keywords" content="ElasticSearch 权威指南中文版, elasticsearch 7, es7, 实时数据分析，实时数据检索" />
    <meta name="description" content="ElasticSearch 权威指南中文版, elasticsearch 7, es7, 实时数据分析，实时数据检索" />
    <!-- Give IE8 a fighting chance -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
    <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
	<link rel="stylesheet" type="text/css" href="../static/styles.css" />
	<script>
	var _link = 'analysis-hyp-decomp-tokenfilter.html';
    </script>
</head>
<body>
<div class="main-container">
    <section id="content">
        <div class="content-wrapper">
            <section id="guide" lang="zh_cn">
                <div class="container">
                    <div class="row">
                        <div class="col-xs-12 col-sm-8 col-md-8 guide-section">
                            <div style="color:gray; word-break: break-all; font-size:12px;">原英文版地址: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-hyp-decomp-tokenfilter.html" rel="nofollow" target="_blank">https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-hyp-decomp-tokenfilter.html</a>, 原文档版权归 www.elastic.co 所有<br/>本地英文版地址: <a href="../en/analysis-hyp-decomp-tokenfilter.html" rel="nofollow" target="_blank">../en/analysis-hyp-decomp-tokenfilter.html</a></div>
                        <!-- start body -->
                  <div class="page_header">
<strong>重要</strong>: 此版本不会发布额外的bug修复或文档更新。最新信息请参考 <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html" rel="nofollow">当前版本文档</a>。
</div>
<div id="content">
<div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.html">Elasticsearch Guide [7.7]</a></span>
»
<span class="breadcrumb-link"><a href="analysis.html">Text analysis</a></span>
»
<span class="breadcrumb-link"><a href="analysis-tokenfilters.html">Token filter reference</a></span>
»
<span class="breadcrumb-node">Hyphenation decompounder token filter</span>
</div>
<div class="navheader">
<span class="prev">
<a href="analysis-hunspell-tokenfilter.html">« Hunspell token filter</a>
</span>
<span class="next">
<a href="analysis-keep-types-tokenfilter.html">Keep types token filter »</a>
</span>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h2 class="title">
<a id="analysis-hyp-decomp-tokenfilter"></a>Hyphenation decompounder token filter<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenfilters/hyphenation-decompounder-tokenfilter.asciidoc">edit</a>
</h2>
</div></div></div>

<p>Uses XML-based hyphenation patterns to find potential subwords in compound
words. These subwords are then checked against the specified word list. Subwords not
in the list are excluded from the token output.</p>
<p>This filter uses Lucene’s
<a href="https://lucene.apache.org/core/8_5_1/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilter.html" class="ulink" target="_top">HyphenationCompoundWordTokenFilter</a>,
which was built for Germanic languages.</p>
<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="analysis-hyp-decomp-tokenfilter-analyze-ex"></a>Example<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenfilters/hyphenation-decompounder-tokenfilter.asciidoc">edit</a>
</h3>
</div></div></div>
<p>The following <a class="xref" href="indices-analyze.html" title="Analyze API">analyze API</a> request uses the
<code class="literal">hyphenation_decompounder</code> filter to find subwords in <code class="literal">Kaffeetasse</code> based on
German hyphenation patterns in the <code class="literal">analysis/hyphenation_patterns.xml</code> file. The
filter then checks these subwords against a list of specified words: <code class="literal">kaffee</code>,
<code class="literal">zucker</code>, and <code class="literal">tasse</code>.</p>
<div class="pre_wrapper lang-console">
<pre class="programlisting prettyprint lang-console">GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "hyphenation_decompounder",
      "hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
      "word_list": ["Kaffee", "zucker", "tasse"]
    }
  ],
  "text": "Kaffeetasse"
}</pre>
</div>
<div class="console_widget" data-snippet="snippets/933.console"></div>
<p>The filter produces the following tokens:</p>
<div class="pre_wrapper lang-text">
<pre class="programlisting prettyprint lang-text">[ Kaffeetasse, Kaffee, tasse ]</pre>
</div>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="analysis-hyp-decomp-tokenfilter-configure-parms"></a>Configurable parameters<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenfilters/hyphenation-decompounder-tokenfilter.asciidoc">edit</a>
</h3>
</div></div></div>
<div class="variablelist">
<dl class="variablelist">
<dt>
<span class="term">
<code class="literal">hyphenation_patterns_path</code>
</span>
</dt>
<dd>
<p>(Required, string)
Path to an Apache FOP (Formatting Objects Processor) XML hyphenation pattern file.</p>
<p>This path must be absolute or relative to the <code class="literal">config</code> location. Only FOP v1.2
compatible files are supported.</p>
<p>For example FOP XML hyphenation pattern files, refer to:</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
<a href="http://offo.sourceforge.net/#FOP+XML+Hyphenation+Patterns" class="ulink" target="_top">Objects For Formatting Objects (OFFO) Sourceforge project</a>
</li>
<li class="listitem">
<a href="https://sourceforge.net/projects/offo/files/offo-hyphenation/1.2/offo-hyphenation_v1.2.zip/download" class="ulink" target="_top">offo-hyphenation_v1.2.zip direct download</a>
</li>
</ul>
</div>
</dd>
<dt>
<span class="term">
<code class="literal">word_list</code>
</span>
</dt>
<dd>
<p>(Required*, array of strings)
A list of subwords. Subwords found using the hyphenation pattern but not in this
list are excluded from the token output.</p>
<p>You can use the <a class="xref" href="analysis-dict-decomp-tokenfilter.html" title="Dictionary decompounder token filter"><code class="literal">dictionary_decompounder</code></a>
filter to test the quality of word lists before implementing them.</p>
<p>Either this parameter or <code class="literal">word_list_path</code> must be specified.</p>
</dd>
<dt>
<span class="term">
<code class="literal">word_list_path</code>
</span>
</dt>
<dd>
<p>(Required*, string)
Path to a file containing a list of subwords. Subwords found using the
hyphenation pattern but not in this list are excluded from the token output.</p>
<p>This path must be absolute or relative to the <code class="literal">config</code> location, and the file
must be UTF-8 encoded. Each token in the file must be separated by a line break.</p>
<p>You can use the <a class="xref" href="analysis-dict-decomp-tokenfilter.html" title="Dictionary decompounder token filter"><code class="literal">dictionary_decompounder</code></a>
filter to test the quality of word lists before implementing them.</p>
<p>Either this parameter or <code class="literal">word_list</code> must be specified.</p>
</dd>
<dt>
<span class="term">
<code class="literal">max_subword_size</code>
</span>
</dt>
<dd>
(Optional, integer)
Maximum subword character length. Longer subword tokens are excluded from the
output. Defaults to <code class="literal">15</code>.
</dd>
<dt>
<span class="term">
<code class="literal">min_subword_size</code>
</span>
</dt>
<dd>
(Optional, integer)
Minimum subword character length. Shorter subword tokens are excluded from the
output. Defaults to <code class="literal">2</code>.
</dd>
<dt>
<span class="term">
<code class="literal">min_word_size</code>
</span>
</dt>
<dd>
(Optional, integer)
Minimum word character length. Shorter word tokens are excluded from the
output. Defaults to <code class="literal">5</code>.
</dd>
<dt>
<span class="term">
<code class="literal">only_longest_match</code>
</span>
</dt>
<dd>
(Optional, boolean)
If <code class="literal">true</code>, only include the longest matching subword. Defaults to <code class="literal">false</code>.
</dd>
</dl>
</div>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="analysis-hyp-decomp-tokenfilter-customize"></a>Customize and add to an analyzer<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/tokenfilters/hyphenation-decompounder-tokenfilter.asciidoc">edit</a>
</h3>
</div></div></div>
<p>To customize the <code class="literal">hyphenation_decompounder</code> filter, duplicate it to create the
basis for a new custom token filter. You can modify the filter using its
configurable parameters.</p>
<p>For example, the following <a class="xref" href="indices-create-index.html" title="Create index API">create index API</a> request
uses a custom <code class="literal">hyphenation_decompounder</code> filter to configure a new
<a class="xref" href="analysis-custom-analyzer.html" title="Create a custom analyzer">custom analyzer</a>.</p>
<p>The custom <code class="literal">hyphenation_decompounder</code> filter find subwords based on hyphenation
patterns in the <code class="literal">analysis/hyphenation_patterns.xml</code> file. The filter then checks
these subwords against the list of words specified in the
<code class="literal">analysis/example_word_list.txt</code> file. Subwords longer than 22 characters are
excluded from the token output.</p>
<div class="pre_wrapper lang-console">
<pre class="programlisting prettyprint lang-console">PUT hyphenation_decompound_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_hyphenation_decompound": {
          "tokenizer": "standard",
          "filter": [ "22_char_hyphenation_decompound" ]
        }
      },
      "filter": {
        "22_char_hyphenation_decompound": {
          "type": "hyphenation_decompounder",
          "word_list_path": "analysis/example_word_list.txt",
          "hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
          "max_subword_size": 22
        }
      }
    }
  }
}</pre>
</div>
<div class="console_widget" data-snippet="snippets/934.console"></div>
</div>

</div>
<div class="navfooter">
<span class="prev">
<a href="analysis-hunspell-tokenfilter.html">« Hunspell token filter</a>
</span>
<span class="next">
<a href="analysis-keep-types-tokenfilter.html">Keep types token filter »</a>
</span>
</div>
</div>

                  <!-- end body -->
                        </div>
                        <div class="col-xs-12 col-sm-4 col-md-4" id="right_col">
                        
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </section>
</div>
<script src="../static/cn.js"></script>
</body>
</html>